Analysis For Instagram Data¶

in This project , iwill work on instagram data

Instagram Data Field Description¶


  • Below is description of column field in in the dataset:

Impressions: Total number of views the post received measure reach

From Home: Views from the home feed traffic source

From Hashtags: Views coming through hashtags effectiveness

From Explore: Views from the Explore page discoverability to new users

From Other: Views from other sources stories, direct shares

Saves: Number of saves valuable content

Comments: Number of comments engagement and discussion

Shares: Number of times shared shareable the content is

Likes: Number of likes metric

Profile Visits: Visits to profile from this post interest

Follows: New followers gained from the post metric

Caption: Text content of the post for keywords or tone

Hashtags: Hashtags used in the post discoverability and reach

Question to be Answered dapending an Analysis¶

  • Does the view_count affect the number of comments?
  • Does the number of followers affect the number of likes?
  • Do profile visits lead to more followers?
  • Do posts with higher saves also receive more profile visits or follows?
  • What is the relationship between impressions and likes, comments, and shares?
  • What is the engagement rate per impression for each post?
  • is there a relationship between the number of hashtags used and the impressions from hashtag?
In [1]:
##load needed Modules
import pandas as pd
In [2]:
## display all data columns 
pd.options.display.max_columns=None
In [3]:
## load the dataset into DataFrame 
df=pd.read_csv(r"C:\Users\DR SYSTEM\Downloads\Instagram data.csv")
In [4]:
## display first rows 
df.head(2)
Out[4]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... #finance #money #business #investing #investme...
1 5394 2727 1838 1174 78 194 7 14 224 48 10 Here are some of the best data science project... #healthcare #health #covid #data #datascience ...
In [5]:
## check for dataframe shape
df.shape
Out[5]:
(119, 13)

We found that the data has around 119 row with 13 columns

In [6]:
## check for data info (quality)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Impressions     119 non-null    int64 
 1   From Home       119 non-null    int64 
 2   From Hashtags   119 non-null    int64 
 3   From Explore    119 non-null    int64 
 4   From Other      119 non-null    int64 
 5   Saves           119 non-null    int64 
 6   Comments        119 non-null    int64 
 7   Shares          119 non-null    int64 
 8   Likes           119 non-null    int64 
 9   Profile Visits  119 non-null    int64 
 10  Follows         119 non-null    int64 
 11  Caption         119 non-null    object
 12  Hashtags        119 non-null    object
dtypes: int64(11), object(2)
memory usage: 12.2+ KB
  • there are some columns need to be dropped([' From Other'])
  • some columns need to be renamed to be more readable

Feature Engineering¶

  • add 3 columns to extract the home_ratio , haghtag_ratio , explore_ratio
  • add 3 columns like_ratio , comment_ratio , share_ratio , save_ratio
  • add columns hashtag_count
  • add profile visit rate
In [7]:
## copy the dataframe 
df_copy=df.copy()
In [8]:
## check for duplicates 
df.duplicated().sum()
Out[8]:
17

there is 17 duplicated row

In [9]:
## drop duplicates 
df.drop_duplicates(inplace=True)
In [10]:
#check
df.duplicated().sum()
Out[10]:
0
In [11]:
#check for null values
df.isnull().sum()
Out[11]:
Impressions       0
From Home         0
From Hashtags     0
From Explore      0
From Other        0
Saves             0
Comments          0
Shares            0
Likes             0
Profile Visits    0
Follows           0
Caption           0
Hashtags          0
dtype: int64
In [12]:
## list columns to be renamed 
df.columns
Out[12]:
Index(['Impressions', 'From Home', 'From Hashtags', 'From Explore',
       'From Other', 'Saves', 'Comments', 'Shares', 'Likes', 'Profile Visits',
       'Follows', 'Caption', 'Hashtags'],
      dtype='object')
In [13]:
# rename desired columns 
df.rename(columns={'Impressions':'view_count'}, inplace=True)
In [14]:
# check dataframe 
df.head(1)
Out[14]:
view_count From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... #finance #money #business #investing #investme...
In [15]:
## Proportion of view_count coming from Home , hashtags , explore
df['home_ratio']=df['From Home'] / df['view_count']
df['hashtag_ratio']=df['From Hashtags'] / df['view_count']
df['explore_ratio']=df['From Explore'] / df['view_count']
In [16]:
df['home_ratio']
df['hashtag_ratio']
df['explore_ratio']
Out[16]:
0      0.157908
1      0.217649
2      0.000000
3      0.205830
4      0.110802
         ...   
114    0.390657
115    0.395393
116    0.330273
117    0.532620
118    0.445408
Name: explore_ratio, Length: 102, dtype: float64

some posts receive more than 50% of their impressions from the Explore page.

In [17]:
## show like_ratio , comment_ratio , share_ratio , save_ratio 
df['like_ratio']=df['Likes'] / df['view_count']
df['comment_ratio']=df['Comments'] / df['view_count']
df['share_ratio ']=df['Shares'] / df['view_count']
In [18]:
df['like_ratio'] 
df['comment_ratio'] 
df['share_ratio ']
Out[18]:
0      0.001276
1      0.002595
2      0.000249
3      0.001546
4      0.001589
         ...   
114    0.002774
115    0.000174
116    0.000242
117    0.002294
118    0.000704
Name: share_ratio , Length: 102, dtype: float64

The share_ratio is very low across all posts.

In [19]:
## add column (hashtag_count)
def hashtag_count(text): 
    if pd.isna(text):
        return 0
    return text.count('#')
In [20]:
df['Hashtag_count']=df['Hashtags'].apply(hashtag_count)
df['Hashtag_count'].head()
Out[20]:
0    22
1    18
2    18
3    11
4    29
Name: Hashtag_count, dtype: int64
In [21]:
## add column profile visit rate 
df['profile_visit_rate']=df['Profile Visits'] /df['view_count'] *100 
In [22]:
df['profile_visit_rate']
Out[22]:
0      0.892857
1      0.889878
2      1.541905
3      0.507951
4      0.317712
         ...   
114    0.532847
115    0.348979
116    0.821454
117    0.452669
118    1.654974
Name: profile_visit_rate, Length: 102, dtype: float64

Add a column representing the profile visit rate as a percentage of total views for each post

In [23]:
#check dataframe 
df.head(2)
Out[23]:
view_count From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags home_ratio hashtag_ratio explore_ratio like_ratio comment_ratio share_ratio Hashtag_count profile_visit_rate
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... #finance #money #business #investing #investme... 0.659694 0.262245 0.157908 0.041327 0.002296 0.001276 22 0.892857
1 5394 2727 1838 1174 78 194 7 14 224 48 10 Here are some of the best data science project... #healthcare #health #covid #data #datascience ... 0.505562 0.340749 0.217649 0.041528 0.001298 0.002595 18 0.889878
In [39]:
df.describe().round(2)
Out[39]:
view_count From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows home_ratio hashtag_ratio explore_ratio like_ratio comment_ratio share_ratio Hashtag_count profile_visit_rate engagement rate (%)
count 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.00 102.0 102.00 102.00 102.00 102.00
mean 5920.25 2496.91 1968.28 1178.57 184.55 156.55 6.35 9.30 176.82 54.67 22.82 0.50 0.32 0.13 0.03 0.0 0.00 18.55 0.75 6.32
std 5139.89 1588.38 1977.30 2797.21 309.10 157.77 3.31 10.15 85.15 93.17 43.69 0.17 0.18 0.14 0.01 0.0 0.00 4.80 0.59 2.05
min 1941.00 1133.00 116.00 0.00 9.00 22.00 0.00 0.00 72.00 4.00 0.00 0.10 0.03 0.00 0.01 0.0 0.00 10.00 0.16 3.05
25% 3556.00 1923.75 753.00 178.75 40.25 70.50 4.00 3.00 122.00 16.00 4.00 0.38 0.19 0.04 0.03 0.0 0.00 17.00 0.39 4.77
50% 4343.50 2216.00 1326.00 337.00 75.00 111.00 6.00 6.50 157.50 24.00 8.00 0.49 0.30 0.08 0.03 0.0 0.00 18.00 0.50 6.22
75% 6296.25 2605.25 2415.75 728.50 218.50 173.50 8.00 13.00 208.75 45.75 18.00 0.62 0.44 0.15 0.04 0.0 0.00 20.00 0.95 7.47
max 36919.00 13473.00 11817.00 17414.00 2547.00 1095.00 19.00 75.00 549.00 611.00 260.00 0.92 0.74 0.70 0.05 0.0 0.01 30.00 3.17 13.03

Q1: Does the view_count affect the number of comments?¶

In [25]:
#LOAD Nedeed Modules 
import seaborn as sns 
import matplotlib.pyplot as plt
In [26]:
sns.regplot(data=df, x='view_count' , y='Comments')
plt.title('view_count Vs Comment')
plt.ylabel('Comment')
plt.xlabel('view_count')
plt.show()
No description has been provided for this image

There is no strong correlation between the number of views and the number of comments on the posts in your data."

Q2: Does the number of followers affect the number of likes?¶

In [27]:
#LOAD Nedeed Modules 
import plotly.express as px
In [28]:
px.scatter(df,x='Follows' , y='Likes' , trendline='ols')

there is a strong positive correlation between the number of likes and the number of new followers in your data.

Q3: Do profile visits lead to more followers?¶

In [29]:
#load nedeed Modules 
import plotly.express as px
In [30]:
px.scatter(df, x='Profile Visits' , y='Follows' , trendline='ols')

Q4:Do posts with higher saves also receive more profile visits or follows?¶

In [31]:
#load nedeed Modules 
import plotly.express as px
In [32]:
px.scatter(df,x='Saves',y='Follows',trendline='ols')

there is a strong positive correlation between the number of saves and the number of new followers

In [33]:
px.scatter(df,x='Saves',y='Profile Visits',trendline='ols')

There is also a positive but weaker correlation between saves and profile visits

Q5:What is the relationship between impressions and likes, comments, and shares?¶

In [34]:
#Load nedeed Modules 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
In [35]:
columns=['view_count','Likes','Comments','Shares']
correlation=df[columns].corr()
print(correlation)

sns.heatmap(correlation,annot=True)
plt.title('The relationship between impressions and engagement')
plt.show()
            view_count     Likes  Comments    Shares
view_count    1.000000  0.852952 -0.008535  0.654920
Likes         0.852952  1.000000  0.163383  0.718790
Comments     -0.008535  0.163383  1.000000  0.012697
Shares        0.654920  0.718790  0.012697  1.000000
No description has been provided for this image

Strong correlation between view_count and Likes (0.85): More views generally mean more likes.

Moderate correlation between view_count and Shares (0.65): More views lead to more shares.

No meaningful correlation between view_count and Comments (-0.0085): Comments don’t necessarily increase with views

Likes and Shares are highly correlated (0.72): People who like content are likely to share it.

Comments are not strongly correlated with anything in this matrix.

Q6:What is the engagement rate per impression for each post?¶

In [36]:
#Load nedeed Modules 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
In [37]:
df['engagement rate (%)']=((df['Likes']+df['Comments']+df['Saves']+df['Shares'])/df['view_count'])*100
print(df[['view_count','Likes','Comments','Shares','Saves','engagement rate (%)']])
     view_count  Likes  Comments  Shares  Saves  engagement rate (%)
0          3920    162         9       5     98             6.989796
1          5394    224         7      14    194             8.138673
2          4021    131        11       1     41             4.575976
3          4528    213        10       7    172             8.878092
4          2518    123         5       4     96             9.054805
..          ...    ...       ...     ...    ...                  ...
114       13700    373         2      38    573             7.197080
115        5731    148         4       1    135             5.025301
116        4139     92         0       1     36             3.116695
117       32695    549         2      75   1095             5.263802
118       36919    443         5      26    653             3.052629

[102 rows x 6 columns]

Q7:is there a relationship between the number of hashtags used and the impressions from hashtag?¶

In [41]:
#load nedeed Modules
import seaborn as sns
import matplotlib.pyplot as pl
In [48]:
plt.figure(figsize=(8, 6))
sns.regplot(data=df, x='Hashtag_count', y='From Hashtags')
plt.title("Relationship Between Hashtag Count and Impressions from Hashtags")
plt.xlabel("Number of Hashtags Used")
plt.ylabel("Impressions from Hashtags")
plt.show()
No description has been provided for this image

"Do not rely only on the quantity of hashtags but on the quality and relevance of the hashtags to your content.

(It is better to choose more specific and strongly related hashtags.)

Colclusion¶

We found that the data has around 119 row with 13 columns

There is no strong correlation between the number of views and the number of comments on the posts in your data."

there is a strong positive correlation between the number of likes and the number of new followers in your data.

there is a strong positive correlation between the number of saves and the number of new followers

There is also a positive but weaker correlation between saves and profile visits

Strong correlation between view_count and Likes (0.85): More views generally mean more likes.

Moderate correlation between view_count and Shares (0.65): More views lead to more shares.

No meaningful correlation between view_count and Comments (-0.0085): Comments don’t necessarily increase with views

Likes and Shares are highly correlated (0.72): People who like content are likely to share it.

Comments are not strongly correlated with anything in this matrix.

"Do not rely only on the quantity of hashtags but on the quality and relevance of the hashtags to your content.

(It is better to choose more specific and strongly related hashtags.)

In [ ]:
 
In [ ]: